Class-Based TF-IDF
Class-Based TF-IDF
Class-based TF-IDF is a new topic representation method proposed in BERTopic.
Normal TF-IDF calculates the importance of a word in each document. On the other hand, class-based TF-IDF groups documents by topic and considers them as one pseudo document, and calculates the importance of words in that document class.
Specifically, it is calculated as follows
Wt,c = tft,c · log(1 + A/tft)
tft,c : frequency of occurrence of the word t in documents contained in topic c
A : Average frequency of words in the corpus
tft : Frequency of occurrence of word t in the whole corpus
While regular TF-IDF treats each document independently, class-based TF-IDF is unique in that it groups documents by topic. This allows for a more direct calculation of the importance of words that characterize a topic.
BERTopic states that using this class-based TF-IDF for generating topic representations has yielded better results than traditional cluster-centric-based topic representations. It can be said that this method contributes to improving the interpretability of topics.
---
This page is auto-translated from /nishio/クラスベースTF-IDF using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.